Skip to content

Experiment tracking#86

Merged
AymenFJA merged 7 commits intomainfrom
feature/experiment_tracking
Mar 12, 2026
Merged

Experiment tracking#86
AymenFJA merged 7 commits intomainfrom
feature/experiment_tracking

Conversation

@AymenFJA
Copy link
Collaborator

@AymenFJA AymenFJA commented Mar 10, 2026

This PR introduces experiment tracking capabilities:

  1. Core vanilla file-based.
  2. Integration: ClearML
  3. Integration: MLFlow

This PR only supports ROSE tracking capabilities on the learner level and specifically the outer loop of the learning approach.

MLFLOW

image image

CLEAR ML

image

AymenFJA added 3 commits March 9, 2026 21:34
1-Base Tracking capabilites
2-PNP approach for trackers
3-Native File based exp. tracker
4-MLFLOW exp. tracker
5-ClearML exp. tracker
6-Extend examples
7-Extend docs
1-Test suite for tracking
2-Examples for tracking
3-Update docs
1-Modify the approach so it aligns with ROSE working on OUTER loop of the learning and not the INNER Loop
2-Update Docs
3-Update Examples
4-Update Tests
@AymenFJA AymenFJA self-assigned this Mar 10, 2026
@AymenFJA AymenFJA added raas ROSE As A Service Q1 labels Mar 10, 2026
@AymenFJA AymenFJA changed the title Feature/experiment tracking Experiment tracking Mar 10, 2026
@gemini-code-assist
Copy link

Warning

Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting /gemini summary.

@AymenFJA
Copy link
Collaborator Author

/gemini summary

@gemini-code-assist
Copy link

Summary of Changes

This pull request introduces a robust and flexible experiment tracking system to ROSE, significantly enhancing its observability capabilities. By implementing a pluggable TrackerBase protocol, it cleanly separates experiment logging concerns from the core learning algorithms. This allows users to effortlessly integrate popular platforms like ClearML and MLflow, or utilize a simple file-based logger, with minimal code changes. The system automatically captures pipeline configurations, iteration metrics, and stop reasons, promoting reproducibility and providing richer insights into learning workflows, particularly beneficial in HPC environments prone to job preemption.

Highlights

  • Pluggable Experiment Tracking System: Introduced a new TrackerBase protocol and PipelineManifest to enable a flexible and extensible experiment tracking system, allowing automatic logging of pipeline events without modifying the core learning loop.
  • ClearML Integration: Added a native integration for ClearML via ClearMLTracker, providing seamless experiment tracking with ClearML, including support for parallel learner comparison.
  • MLflow Integration: Implemented a native integration for MLflow via MLflowTracker, enabling automatic logging of parameters, metrics, and tags, and deprecating the previous manual wiring approach.
  • HPC-Safe File-Based Tracking: Included a basic HPC_FileTracker for robust, append-only, and preemption-safe logging of experiment data to JSON Lines files, suitable for High-Performance Computing environments.
  • Documentation and Examples: Updated the documentation with a comprehensive user guide for the new tracking system and detailed guides for ClearML and MLflow integrations, alongside new example scripts demonstrating the add_tracker() approach.
  • Core Learner Refactoring: Refactored SequentialActiveLearner, ParallelUQLearner, and SequentialRLLearner classes to incorporate tracker notification callbacks at key lifecycle points, ensuring consistent and automatic event dispatching.
  • Dependency Management: Added optional dependencies for mlflow and clearml in pyproject.toml, allowing users to install only the required tracking integrations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Added *.db to ignore database files.
  • docs/index.md
    • Updated the main documentation to highlight preemption-safe HPC operation and clean separation of control and observability.
  • docs/integrations/clearml.md
    • Added new documentation detailing the ClearML integration, including quick start, automatic logging, parallel learner comparison, and extension points.
  • docs/integrations/mlflow.md
    • Rewrote the MLflow integration documentation to introduce the MLflowTracker and deprecate the manual wiring approach.
  • docs/user-guide/tracking.md
    • Added a new user guide explaining the pluggable tracking system, TrackerBase protocol, PipelineManifest, and built-in trackers.
  • examples/integrations/mlflow/README.md
    • Removed the old MLflow integration README.
  • examples/integrations/mlflow/mlflow_rose.py
    • Removed the old manual MLflow integration example script.
  • examples/integrations/tracking/README.md
    • Added a new README to introduce the tracking examples.
  • examples/integrations/tracking/basic.py
    • Added a new example demonstrating the HPC_FileTracker for basic file-based logging.
  • examples/integrations/tracking/clearml/run_me.py
    • Added a new example showcasing the ClearML tracker with parallel ensemble UQ.
  • examples/integrations/tracking/mlflow/run_me_tracker.py
    • Added a new example demonstrating the MLflow tracker with an adaptive kernel schedule.
  • mkdocs.yml
    • Updated the documentation navigation to include the new tracking user guide and ClearML integration.
  • pyproject.toml
    • Added optional dependencies for mlflow, clearml, and a combined tracking extra.
  • rose/init.py
    • Exported PipelineManifest and TrackerBase for public API access.
  • rose/al/active_learner.py
    • Modified the start method to notify registered trackers at iteration completion and loop exit.
  • rose/integrations/clearml_tracker.py
    • Added a new ClearMLTracker class implementing the TrackerBase protocol for ClearML integration.
  • rose/integrations/mlflow_tracker.py
    • Added a new MLflowTracker class implementing the TrackerBase protocol for MLflow integration.
  • rose/learner.py
    • Implemented the add_tracker method, _build_pipeline_manifest for pipeline introspection, and _notify_trackers_iteration/_notify_trackers_stop for dispatching events.
  • rose/rl/reinforcement_learner.py
    • Modified the start method to notify registered trackers at iteration completion and loop exit.
  • rose/tracking.py
    • Added a new module defining the TrackerBase protocol, TaskManifest, CriterionManifest, and PipelineManifest dataclasses.
  • tests/integration/tracking/test_clearml_tracker.py
    • Added new integration tests to verify the functionality of ClearMLTracker.
  • tests/integration/tracking/test_mlflow_tracker.py
    • Added new integration tests to verify the functionality of MLflowTracker.
  • tests/unit/tracking/test_tracker_core.py
    • Added new unit tests for the TrackerBase protocol and manifest data structures.
  • tests/unit/tracking/test_tracker_interface.py
    • Added new unit tests to verify the add_tracker and notification mechanisms within the learner.
Activity
  • A bot attempted to summarize the PR but failed due to high traffic.
  • The author re-requested the summary.

@AymenFJA
Copy link
Collaborator Author

@gemini-code-assist review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-designed and comprehensive experiment tracking system to ROSE, featuring a new TrackerBase protocol and integrations for file-based logging, MLflow, and ClearML. While the design, documentation, and test coverage are strong, a critical security vulnerability has been identified: a high-severity insecure deserialization issue in example scripts due to pickle usage on predictable file paths in /tmp. Additionally, there's a medium-severity concern regarding potential leakage of sensitive information from task decorators to tracking services. Addressing these security concerns is paramount, and further areas for improvement are detailed in specific comments.


def load_state() -> dict:
with open(DATA_FILE, "rb") as f:
return pickle.load(f)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The use of pickle.load() on a file located in a world-writable directory (/tmp) with a predictable filename is insecure. A local attacker could place a malicious pickle file at this location to achieve arbitrary code execution when this example script is run. This is particularly risky in shared environments like HPC clusters. Consider using a safer serialization format like JSON or ensuring the file is created with secure permissions and a non-predictable name.

def load_state(name: str) -> dict:
path = Path(str(DATA_FILE).format(name=name))
with open(path, "rb") as f:
return pickle.load(f)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The use of pickle.load() on a file located in a world-writable directory (/tmp) with a predictable filename is insecure. A local attacker could place a malicious pickle file at this location to achieve arbitrary code execution when this example script is run. Consider using a safer serialization format like JSON or ensuring the file is created with secure permissions and a non-predictable name.


def load_state() -> dict:
with open(DATA_FILE, "rb") as f:
return pickle.load(f)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The use of pickle.load() on a file located in a world-writable directory (/tmp) with a predictable filename is insecure. A local attacker could place a malicious pickle file at this location to achieve arbitrary code execution when this example script is run. Consider using a safer serialization format like JSON or ensuring the file is created with secure permissions and a non-predictable name.

Comment on lines +68 to +69
for k, v in task_manifest.decor_kwargs.items():
params[f"task/{task_key}/{k}"] = v

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The tracker logs all keyword arguments from task decorators (decor_kwargs) to ClearML. If a user passes sensitive information (e.g., API keys, tokens) to a decorator, it will be logged and stored in the tracking service. It is recommended to filter out common sensitive keys before logging.

Suggested change
for k, v in task_manifest.decor_kwargs.items():
params[f"task/{task_key}/{k}"] = v
for k, v in task_manifest.decor_kwargs.items():
if k.lower() not in {"api_key", "password", "secret", "token"}:
params[f"task/{task_key}/{k}"] = v

Comment on lines +64 to +65
for k, v in task_manifest.decor_kwargs.items():
params[f"task.{task_key}.{k}"] = v

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The tracker logs all keyword arguments from task decorators (decor_kwargs) to MLflow. If a user passes sensitive information (e.g., API keys, tokens) to a decorator, it will be logged and stored in the tracking service. It is recommended to filter out common sensitive keys before logging.

Suggested change
for k, v in task_manifest.decor_kwargs.items():
params[f"task.{task_key}.{k}"] = v
for k, v in task_manifest.decor_kwargs.items():
if k.lower() not in {"api_key", "password", "secret", "token"}:
params[f"task.{task_key}.{k}"] = v

Comment on lines +127 to +132
def on_stop(self, final_state, reason: str) -> None:
super().on_stop(final_state, reason)
if final_state and reason in ("criterion_met", "max_iter_reached"):
model = load_model(final_state.get("checkpoint_path"))
mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example for extending MLflowTracker has a small bug. It calls super().on_stop() before logging the model artifact. The base on_stop method calls mlflow.end_run(), which terminates the MLflow run. Any subsequent calls to log artifacts will either fail or start a new, separate run.

To ensure all logging happens within the same active run, the super().on_stop() call should be moved to the end of the method, after the artifact has been logged. This pattern is correctly used in the ClearMLTracker extension example.

Suggested change
def on_stop(self, final_state, reason: str) -> None:
super().on_stop(final_state, reason)
if final_state and reason in ("criterion_met", "max_iter_reached"):
model = load_model(final_state.get("checkpoint_path"))
mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
```
class MLflowArtifactTracker(MLflowTracker):
def on_stop(self, final_state, reason: str) -> None:
if final_state and reason in ("criterion_met", "max_iter_reached"):
model = load_model(final_state.get("checkpoint_path"))
mlflow.sklearn.log_model(model, artifact_path="surrogate_model")
super().on_stop(final_state, reason)

Comment on lines +513 to +519
_stop_reason = "max_iter_reached"
try:
async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
self._notify_trackers_iteration(state)
yield state
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _stop_reason for the ParallelActiveLearner is initialized to "max_iter_reached" and is never updated. If the user breaks out of the async for loop, the finally block will execute and _notify_trackers_stop will be called with the incorrect reason. It should report "stopped" in this case.

A better pattern would be to default the reason to "stopped" and only update it to "max_iter_reached" if the loop completes fully.

Suggested change
_stop_reason = "max_iter_reached"
try:
async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
self._notify_trackers_iteration(state)
yield state
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)
_stop_reason = "stopped"
try:
async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
self._notify_trackers_iteration(state)
yield state
_stop_reason = "max_iter_reached"
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)

Comment on lines +937 to +943
_stop_reason = "max_iter_reached"
try:
async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
self._notify_trackers_iteration(state)
yield state
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _stop_reason for the ParallelReinforcementLearner is initialized to "max_iter_reached" and is never updated. If the user breaks out of the async for loop, the finally block will execute and _notify_trackers_stop will be called with the incorrect reason. It should report "stopped" in this case.

A better pattern would be to default the reason to "stopped" and only update it to "max_iter_reached" if the loop completes fully.

Suggested change
_stop_reason = "max_iter_reached"
try:
async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
self._notify_trackers_iteration(state)
yield state
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)
_stop_reason = "stopped"
try:
async for state in _stream_parallel([make_run_fn(i) for i in range(parallel_learners)]):
self._notify_trackers_iteration(state)
yield state
_stop_reason = "max_iter_reached"
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)

Comment on lines +573 to +579
_stop_reason = "max_iter_reached"
try:
async for state in _stream_parallel([make_run_fn(name) for name in learner_names]):
self._notify_trackers_iteration(state)
yield state
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _stop_reason for the ParallelUQLearner is initialized to "max_iter_reached" and is never updated. If the user breaks out of the async for loop, the finally block will execute and _notify_trackers_stop will be called with the incorrect reason. It should report "stopped" in this case.

A better pattern would be to default the reason to "stopped" and only update it to "max_iter_reached" if the loop completes fully.

Suggested change
_stop_reason = "max_iter_reached"
try:
async for state in _stream_parallel([make_run_fn(name) for name in learner_names]):
self._notify_trackers_iteration(state)
yield state
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)
_stop_reason = "stopped"
try:
async for state in _stream_parallel([make_run_fn(name) for name in learner_names]):
self._notify_trackers_iteration(state)
yield state
_stop_reason = "max_iter_reached"
finally:
self._notify_trackers_stop(self._iteration_state, _stop_reason)

1-Isolate the track parameters from the decor kwargs and uses log_params
2-Fix examples to be more realistic
3-Update tests
4-Test with MLFlow and ClearML UI
1-Fix 1 — ClearML series names: ClearMLTracker adds an optional learner_names parameter, improves series naming, and fixes a bug where learner_id=0 was incorrectly mapped to value.
2-Fix 2 — Non-numeric config logging: Add string config logging for MLflow (via tags) and ClearML (via connected hyperparameters) and move _TASK_NAMES to module level to avoid recreating it each iteration.
@AymenFJA AymenFJA merged commit 5db6c20 into main Mar 12, 2026
10 checks passed
@AymenFJA AymenFJA deleted the feature/experiment_tracking branch March 12, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Q1 raas ROSE As A Service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant